Skip to content

feat: local review app, LoC+Zucker ingest, corpus audit (198 entries)#37

Merged
shaypal5 merged 9 commits into
mainfrom
feat/review-app-and-corpus-audit
May 25, 2026
Merged

feat: local review app, LoC+Zucker ingest, corpus audit (198 entries)#37
shaypal5 merged 9 commits into
mainfrom
feat/review-app-and-corpus-audit

Conversation

@shaypal5

Copy link
Copy Markdown
Contributor

Summary

This PR captures a full session of data ingestion, corpus curation, and tooling work. It adds 198 verified entries to the corpus (net, after two audit passes) across two new sources, plus a complete local review application.


Data

New sources ingested

Source Items reviewed Accepted Notes
LoC Hebraic Manuscripts 166 items / 722 pages 27 entries (16 items) PDM-1.0; paginated JSON API
OPenn Zucker Ketubah Collection 288 items 1 entry CC-BY-SA 4.0; Hebrew-text panel only; other 287 rejected (Aramaic/non-Hebrew-script)

Corpus audit passes

Two post-ingest audit passes over all existing entries removed out-of-scope material:

  • Pass 1: 164 entries removed (medieval Geniza fragments, printed pages, non-Hebrew script, misclassified items)
  • Pass 2: 11 entries removed (additional Geniza fragments + over-accepted LoC pages)

Net corpus after all removals: 198 entries, 57 active sources, 228 files, ~382 MiB.

Scan files

Adds scan directories for the 17 accepted sources only (16 LoC + 1 Zucker, ~25 MiB). Rejected/unreviewed scan directories from the same download sessions remain untracked.

PDF → JPEG thumbnails

Five corpus entries that had only a PDF file (no renderable image) were fixed by running pdftoppm -jpeg -r 200 to produce a _thumb.jpg per entry and adding it as role: thumbnail in the files list.


Tooling

Ingest scripts

  • scripts/ingest_loc.py — paginates the LoC JSON API, filters pre-1700/printed items, downloads up to 5 pages per item, writes data/review/loc_pending.jsonl
  • scripts/ingest_zucker.py — parses OPenn TEI manifests for the Zucker collection, writes data/review/zucker_pending.jsonl
  • scripts/merge_review.py — promotes approved decisions into entries.jsonl + sources.jsonl; auto-creates per-item source records so the entry-ID → source-ID constraint is always satisfied

Local review app (scripts/review_app/)

A Flask app (port 5757) for human review of pending batches and the verified corpus. Run with pip install flask && python scripts/review_app/app.py.

Home page — two/three-way view toggle:

  • By Writer — one card per author (name, death year, entry count, date range, sample thumbnail)
  • By Source — one card per source (title, provider, entry count, licenses, sample thumbnail)
  • All Entries — flat scrollable grid of all 198 entries

Corpus stats dashboard (top of home page):

  • Key metrics: entry count, source count, writer count, year span, transcript count
  • License breakdown bar: colour-coded segments with legend (counts + %)

Per-writer / per-source detail pages:

  • Entry cards with rights badge (✓ green / ⚠ yellow), transcript badge, zoomable lightbox
  • Source name shown under each card in writer view; creator names in source view

Corpus audit page (/audit):

  • All corpus entries in a grid; flag for removal + comment per entry
  • Filters: All / Flagged / ⚠ License unclear / With notes
  • Saves to data/review/audit_decisions.json

Global ✎ Actions toggle (nav header, all pages):

  • Off by default — clean browse mode
  • Reveals flag button + comment textarea on every entry card in every view
  • State persists in localStorage; same /api/audit/decide endpoint used everywhere

Batch review UI (/review/<batch_id>):

  • Inverted-accept pattern: all entries dim by default, click to accept
  • Progress bar, approved/rejected counts

Documentation

  • AGENTS.md: tightened corpus scope — 18th century minimum, cursive כתב יד only, Yiddish in Hebrew script in scope, Judeo-Arabic out of scope
  • docs/sources/wikimedia_queue.md: updated Wikimedia queue log
  • README.md, exports, NOTICE.md, CITATION.cff, datapackage.json: regenerated from current index

🤖 Generated with Claude Code

shaypal5 and others added 9 commits May 25, 2026 11:55
…er/source views

- Removed 164 entries flagged during corpus audit (209 remain)
- Marked 20 orphaned source records as rejected
- Review app: redesign home with clickable By Writer / By Source card grids
- Review app: new /writer/<slug> and /source/<id> per-group entry views
- Review app: show transcript status badge per entry (status + license if present)
- Review app: audit page now shows transcript info per entry
- Updated exports and README status (209 entries, 57 sources with entries)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- License breakdown bar: segmented colour bar + legend with counts and %
- Key metric blocks: entries, sources, writers, date range, transcript count
- Warn block shown if any entries have unclear rights
- compute_corpus_stats() helper in app.py; license short-names + colour map

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5 entries stored only a PDF file, which browsers can't display inline
as an <img>. Used pdftoppm at 200 DPI to produce a JPEG thumbnail for
each, added as role=thumbnail prepended in the files list so the review
app picks it up immediately. Original PDF kept as role=original.

Affected entries:
- commons__auerbach_letter_shtenzel_1961__p0001
- commons__bendin_semichah_shtenzel_1933__p0001
- commons__weidenfeld_eruv_letter_1947__p0001
- commons__wosner_halachic_ruling_1981__p0001
- commons__wosner_support_letter_1990__p0001

Validation: 111 sources, 209 entries, 242 files verified.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Actions toggle (✎ Actions button in nav header):
- Hidden by default; one click reveals flag button + comment textarea on
  every entry card across all views (home, group pages, audit)
- State persists in localStorage; CSS-driven via body.show-actions class
- Audit submit button also gated behind the toggle

All Entries view (third tab on home page):
- Flat scrollable grid of all 209 entries, same card style as group pages
- Includes rights/transcript badges, lightbox zoom, action strips
- Browse save bar (sticky bottom) appears when Actions are on; saves to
  the same /api/audit/decide endpoint and merges with existing decisions

Group pages (writer/source):
- Flag/comment action strips added to each entry card (hidden by default)
- Floating browse save bar; loads + merges with existing audit decisions

New API endpoint GET /api/audit/decisions for client-side merge before save

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removed entries (all flagged via audit UI 2026-05-25):
- commons__bodleian_geniza_ms_heb_d_41_4b__p0001  (Geniza fragment)
- commons__bodleian_geniza_ms_heb_e_39_78b__p0001  (Geniza fragment)
- commons__chief_rabbinate_letter_1921__p0001
- commons__chushiel_letter_geniza__p0001           (Geniza)
- commons__damascus_pentateuch_ms_heb_8_7088__p0001
- commons__geniza_education_ts_k5_13__p0001        (Geniza)
- commons__grodzinski_letter_about_kook__p0001
- commons__halper462_exilarch_genealogy__p0001
- loc__2024422570__p0003, p0004, p0005

9 now-orphaned source records marked rejected.
Corpus: 198 entries across 111 sources (228 files).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scripts/ingest_loc.py: paginate LoC Hebraic Manuscripts JSON API,
  filter pre-1700/printed items, download up to 5 pages per item,
  write data/review/loc_pending.jsonl
- scripts/ingest_zucker.py: parse OPenn TEI manifests for the Zucker
  Ketubah Collection, write data/review/zucker_pending.jsonl
- scripts/merge_review.py: promote approved review decisions into
  entries.jsonl + sources.jsonl; auto-creates per-item source records
- scripts/review_app/requirements.txt: Flask dependency for review app
- scripts/review_app/templates/batch.html: batch review UI (invert
  accept pattern — all dim by default, click to accept)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- data/review/loc_pending.jsonl: 722 entries staged from LoC Hebraic
  Manuscripts collection; 166 items, up to 5 pages each
- data/review/loc_decisions.json: 722 decisions (27 approved, 695 rejected)
- data/review/zucker_pending.jsonl: 288 entries staged from OPenn
  Zucker Ketubah Collection
- data/review/zucker_decisions.json: 288 decisions (1 approved, 287 rejected)

These files serve as the audit trail for the two review sessions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds only the scan directories for sources whose entries are in the
corpus index (entries.jsonl). Rejected/unreviewed scan directories
from the same download sessions remain untracked.

Sources included:
- 16 LoC Hebraic Manuscripts items (loc__2018757642 … loc__2023530858)
  accepted from the 166-item LoC review session (27 entries total)
- openn__zucker__ket_z_238 — single accepted Zucker ketubah
  (Hebrew-text panel, CC-BY-SA 4.0)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eview

- app.py full rewrite: extract _enrich_entries() helper (eliminates 4
  copy-pasted enrichment blocks), add mtime-keyed module-level file
  cache, add load_audit_decisions(live_ids) to filter stale decisions
  at load time, fix path traversal in serve_scan() with resolve()+
  relative_to() check, fix save_decisions() to merge-not-clobber via
  existing.update(incoming), fix source_detail() 404 axis (check
  source_id not in sources, not len(entries)==0), fix review_batch()
  hardcoded source_id with primary_sid=max(set(...),key=count), fix
  walrus-operator double-call in group thumb helpers, remove dead
  imports (re, sys, datetime, timezone)

- templates: slim full-entry JSON blobs to ID-only arrays
  (ENTRIES→ENTRY_IDS, ALL_ENTRIES→ALL_ENTRY_IDS) — eliminates ~588 KB
  of tojson payload per page load; update save loops to iterate IDs

- group.html: remove dead fetch('/api/audit/status') try block in
  saveDecisions() that preceded the real merge fetch

- data/review/audit_decisions.json: clear 175 stale decisions
  (all referenced entries removed from corpus in audit passes)

- merge_review.py: prune stale IDs from audit_decisions.json after
  each batch merge so the file stays in sync with the live index

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shaypal5 shaypal5 merged commit 7178d47 into main May 25, 2026
1 check failed
@shaypal5 shaypal5 deleted the feat/review-app-and-corpus-audit branch May 25, 2026 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant